This guided practical will demonstrate that the tidyverse allows to compute summary statistics and visualize datasets efficiently. This dataset is already stored in a tidy tibble, cleaning steps will come in future practicals.
datasauRus packagedatasauRus and tidyverse installedlibrary(datasauRus)
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✓ ggplot2 3.3.3 ✓ purrr 0.3.4
## ✓ tibble 3.1.2 ✓ dplyr 1.0.6
## ✓ tidyr 1.1.3 ✓ stringr 1.4.0
## ✓ readr 1.4.0 ✓ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
there is no package called ‘datasauRus’ appears, it means that the package needs to be installed. Use this:# install.packages("datasauRus")
Since we are dealing with a tibble, we can just type
datasaurus_dozen
## # A tibble: 1,846 x 3
## dataset x y
## <chr> <dbl> <dbl>
## 1 dino 55.4 97.2
## 2 dino 51.5 96.0
## 3 dino 46.2 94.5
## 4 dino 42.8 91.4
## 5 dino 40.8 88.3
## 6 dino 38.7 84.9
## 7 dino 35.6 79.9
## 8 dino 33.1 77.6
## 9 dino 29.0 74.5
## 10 dino 26.2 71.4
## # … with 1,836 more rows
only the first 10 rows are displayed.
dim(), ncol() and nrow()dim(datasaurus_dozen)
## [1] 1846 3
ncol(datasaurus_dozen)
## [1] 3
nrow(datasaurus_dozen)
## [1] 1846
tibble(datasaurus_dozen)
## # A tibble: 1,846 x 3
## dataset x y
## <chr> <dbl> <dbl>
## 1 dino 55.4 97.2
## 2 dino 51.5 96.0
## 3 dino 46.2 94.5
## 4 dino 42.8 91.4
## 5 dino 40.8 88.3
## 6 dino 38.7 84.9
## 7 dino 35.6 79.9
## 8 dino 33.1 77.6
## 9 dino 29.0 74.5
## 10 dino 26.2 71.4
## # … with 1,836 more rows
datasaurus_dozen to the ds_dozen name This aims at populating the Global Environmentds_dozen <- datasaurus_dozen
In the global environment
you want to count the number of unique elements in the column dataset. The function length() returns the length of a vector, such as the unique elements
unique(ds_dozen$dataset) %>% length()
## [1] 13
# n_distinct counts the unique elements in a given vector.
# we use summarise to return only the desired column named n here.
summarise(ds_dozen, n = n_distinct(dataset))
## # A tibble: 1 x 1
## n
## <int>
## 1 13
datasetthe function count in dplyr does the group_by() by the specified column + summarise(n = n()) which returns the number of observation per defined group.
count(ds_dozen, dataset)
## # A tibble: 13 x 2
## dataset n
## <chr> <int>
## 1 away 142
## 2 bullseye 142
## 3 circle 142
## 4 dino 142
## 5 dots 142
## 6 h_lines 142
## 7 high_lines 142
## 8 slant_down 142
## 9 slant_up 142
## 10 star 142
## 11 v_lines 142
## 12 wide_lines 142
## 13 x_shape 142
x & y column. For this, you need to group_by() the appropriate column and then summarise()in summarise() you can define as many new columns as you wish. No need to call it for every single variable.
across()ds_dozen %>%
group_by(dataset) %>%
# across works with first on which columns and second on what to perform on selection
# 2 possibilities to select columns
# summarise(across(where(is.double), list(mean = mean, sd = sd)))
summarise(across(c(x, y), list(mean = mean, sd = sd)))
## # A tibble: 13 x 5
## dataset x_mean x_sd y_mean y_sd
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 away 54.3 16.8 47.8 26.9
## 2 bullseye 54.3 16.8 47.8 26.9
## 3 circle 54.3 16.8 47.8 26.9
## 4 dino 54.3 16.8 47.8 26.9
## 5 dots 54.3 16.8 47.8 26.9
## 6 h_lines 54.3 16.8 47.8 26.9
## 7 high_lines 54.3 16.8 47.8 26.9
## 8 slant_down 54.3 16.8 47.8 26.9
## 9 slant_up 54.3 16.8 47.8 26.9
## 10 star 54.3 16.8 47.8 26.9
## 11 v_lines 54.3 16.8 47.8 26.9
## 12 wide_lines 54.3 16.8 47.8 26.9
## 13 x_shape 54.3 16.8 47.8 26.9
They look all similar based on summary stats. The mean and sd are the same in all datasets.
ds_dozen with ggplot such the aesthetics are aes(x = x, y = y)with the geometry geom_point()
the ggplot() and geom_point() functions must be linked with a + sign
ds_dozen %>%
ggplot(aes(x=x, y =y)) +
geom_point()
dataset columnds_dozen %>%
ggplot(aes(x=x, y =y, colour=dataset)) +
geom_point()
Too many datasets are displayed.
You can filter for one dataset upstream of plotting
ds_dozen %>%
filter(dataset=='away') %>%
ggplot(aes(x=x, y =y, colour=dataset)) +
geom_point()
R provides the inline instruction %in% to test if there a match of the left operand in the right one (a vector most probably)
ds_dozen %>%
filter(dataset %in% c('away', 'bullseye')) %>%
ggplot(aes(x=x, y =y, colour=dataset)) +
geom_point()
dataset per facetFacet is applied in order to split the plots and separate datasets according to a variable.
ds_dozen %>%
filter(dataset %in% c('away', 'bullseye')) %>%
ggplot(aes(x=x, y =y, colour=dataset)) +
geom_point() +
facet_wrap(~ dataset)
ds_dozen %>%
ggplot(aes(x=x, y =y, colour=dataset)) +
geom_point() +
facet_wrap(~ dataset)
theme_void and remove the legendds_dozen %>%
ggplot(aes(x=x, y =y)) +
geom_point() +
facet_wrap(~ dataset)+
theme_void()
No, the summary stats can be misleading
the R package gifski could be installed on your machine, makes the GIF creation faster. gifski is internally written in rust, and this language needs cargo to run. See this article to get it installed on your machine. First install rust before install the R package gifski. Please note, that the animate() step still takes ~ 3-5 minutes depending on your machine.
gganimate, its dependencies will be automatically installed.# install.packages("gganimate")
# install.packages("rust")
# install.packages("gifski")
dataset variable to the transition_states() argument layerlibrary(gganimate)
ds_dozen %>%
ggplot(aes(x = x, y = y)) +
geom_point() +
# transition will be made using the dataset column
transition_states(dataset, transition_length = 5, state_length = 2) +
# for a firework effect!
shadow_wake(wake_length = 0.05) +
labs(title = "dataset: {closest_state}") +
theme_void(14) +
theme(legend.position = "none") -> ds_anim
# more frames to slow down the animation
ds_gif <- animate(ds_anim, nframes = 500, fps = 10, renderer = gifski_renderer())
ds_gif